We would like to thank our professor, Ms. Dahlin, for her constant encouragement and support throughout this project.
Formula 1 is a racing sport and the pinnacle of mechanical engineering where 10 teams (constructors) compete for the world championship. Our dataset comprises of data from 1950, the time that Formula 1 started, to 2017. As in any typical sport, the world championships of Formula 1 racing are decided by a number of parameters. Some of the parameters are qualifying race times (qualifying races usually take place before the main race to determine the starting positions of the cars for the main race) , lap times ( the time required by drivers to lap around a track once) and most importantly pit stop times (when the tires of the cars are changed, the car is refueled, any broken parts are replaced).
This is interesting because the person who is in front of the grid (a grid is the positioning of the cars from the start line of the race) may not necessarily win. Teams can follow a variety of different strategies, i.e., perform an undercut, overcut or push when required, in order to gain an advantage and win in the final stages of a race.
This variation of data can be analyzed and visualized in numerous ways. Formula 1, being the bleeding edge of mechanical engineering provides a framework for the cars that are going to be designed in the future (usually the technology that is used in formula 1 cars, take a lot of time to be incorporated into normal cars) and therefore this analysis is significant.
Questions of Interest
â How many drivers who did not start the race in the first position went on to win?
â Who is the driver who has won the most number of formula 1 championships in history?
â How has the pitstop times changed over the years?
â How many times has a constructor(teams) won at that specific venue over the years?
Expected findings
â Our analysis indicates that over a period of time, there is a significant reduction in the pitstop time.
â The driver who has the maximum number of race wins, points and championships is Michael Schumacher.
The following information provides the Data mapping used in this project
Circuits: ++
circuitId - Use as key
circuitRef
name
location
country â Useful to specify by country
lat - latitude
lng - longitude
alt
url
Constructors: ++
constructorId â use as key
constructorRef
name
nationality
url
Drivers: ++
driverId â Use as Key
driverRef
number â Fill blanks with -1
code
forename
surname
dob
nationality
url
Races: ++
raceId â Use as Key
year â Season of the race
round â Race number in that year
circuitId â Same as Circuitid from circuit
name
date
time â Fill blanks with 10:00
url
Weather
Results: +++
resultId
raceId â Use key from Races
driverId â Use key from Drivers
constructorId â use key storm constructors
number â Driver number(5 for vettel, Not unique). Fill Blanks with -1
grid â Starting grid position
position â Finishing position. Fill blanks with -1 for retired cards
positionText â Position Text. R stands for retired
positionOrder â Final position order ( This ideally should be the same as position 1âŚn where n is
number of cars that took part in that race)
points â Points won
laps â Number of laps driven in a race. Any driver that retired will have a lower number than those
that finished for a specific RaceID
time
milliseconds
fastestLap â Gives fastest lap number
rank â Rank of the fastest lap times
fastestLapTime â Lap time
fastestLapSpeed â Avg speed in the fastest lap
statusId â StatusId of the car at the end of the race. Use key from Status ( Finished. Did not finish.
Disqualified. Power unit issues. Etc.)
Seasons: +
Year - year
url
Status - ++
statusId â Use this as a key
status â Status of the car at the end of the race and the reason for retiring.
Constructor results: +
constructorResultsId â Use as Key
raceId - Use key from Races
constructorId â Use key from Constructors
points â Points scored
status â Null/ D (D for disqualified)
Constructor Standings:+
constructorStandingsId
raceId â Use key from Races
constructorId â Use key from Constructors
points â Points scored â Fill Blanks with 0
position â Position of constructor
positionText - Text
wins â Wins in that year
Number of '+' indicates the importance of the data set
+: foundation data table
++: Important data table
+++ : Critical data table
# Importing required modules
import pandas as pd
import numpy as np
# Reading the 'driverStandings.csv' file and creating a copy of it.
driverStandings = pd.read_csv('driverStandings.csv')
driverStandingsCopy = driverStandings.copy()
# Renaming its columns
driverStandingsCopy.rename(columns = {'driverStandingsId':'driverStandingsId', 'raceId':'raceId', 'driverId':'driverId',
'points':'Points', 'position':'Position',
'positionText':'positionText', 'wins':'Wins'}, inplace = True)
# Setting the index to 'driverStandingsId'
driverStandingsCopy.set_index('driverStandingsId',inplace=True)
driverStandingsCopy.head()
# Reading the 'results.csv' file and creating a copy of it.
results = pd.read_csv('results.csv')
resultsCopy= results.copy()
# Renaming its columns
resultsCopy.rename(columns = {'resultId':'resultId', 'raceId':'raceId', 'driverId':'driverId',
'constructorId':'constructorId', 'number':'Number', 'grid': 'Grid',
'position':'Position', 'positionText':'positionText', 'positionOrder': 'Position order',
'points':'Points', 'laps':'Laps', 'time':'Time','milliseconds':'Time in milliseconds',
'fastestLap': 'Fastest Lap', 'rank': 'Rank', 'fastestLapTime': 'Fastest Lap Time',
'fastestLapSpeed':'Fastest Lap Speed', 'statusId':'statusId'}, inplace = True)
resultsCopy.set_index('resultId',inplace=True)
#Handling null values for multiple columns in 'results.csv'
values = {'Position': -1, 'Number': -1, 'Time': 0, 'Time in milliseconds': 0}
resultsCopy.fillna(value=values).head()
# Opening the circuits.csv file
# It is a TextIO Wrapper file
circuitsWrapper = open('circuits.csv', encoding = "ISO-8859-1")
circuits = pd.DataFrame(circuitsWrapper)
# Appending the valuesinto a list
list_circuits=[]
# Since the values are comma seperated and stored in a list, we split them
rowLabels_circuits=circuits[0][0].split(',')
# Looping through the length of the DataFrame
for i in range(1,74):
values=circuits[0][i].split(',')
list_circuits.append(values)
# Storing the newly appended list in a DataFrame
circuitsUpdated=pd.DataFrame(list_circuits)
# Creating a copy of the above created DataFrame
circuitsCopy= circuitsUpdated.copy()
# Renaming the columns
circuitsCopy.rename(columns = {0:'circuitId', 1:'CircuitRef', 2:'Name', 3:'Location',
4:'country', 5:'Latitude', 6:'Longitude', 7:'Altitude', 8:'URL'}, inplace = True)
# Setting the index as 'circuitId'
circuitsCopy.set_index('circuitId',inplace=True)
# # Dropping redundant columns
circuitsCopy.drop(['Altitude', 9], axis=1, inplace=True)
circuitMap=circuitsCopy
circuitsCopy.head()
# Reading the 'constructorResults.csv' file and creating a copy of it.
constructorResults = pd.read_csv('constructorResults.csv')
constructorResultsCopy= constructorResults.copy()
# Renaming its columns
constructorResultsCopy.rename(columns = {'constructorResultsId':'constructorResultsId', 'raceId':'raceId',
'constructorId':'constructorId', 'points':'Points',
'status':'Status', 5:'Latitude', 6:'Longitude', 7:'Altitude', 8:'URL'}, inplace = True)
# Setting the index
constructorResultsCopy.set_index('constructorResultsId', inplace=True)
#Handling null values
values = {'Status': 'NULL'}
constructorResultsCopy.fillna(value=values).head()
# Reading the 'constructors.csv' file and creating a copy of it.
constructors = pd.read_csv('constructors.csv')
constructorCopy= constructors.copy()
constructorCopy.head()
# Renaming the columns
constructorCopy.rename(columns = {'constructorId':'constructorId', 'constructorRef':'ConstructorRef',
'name':'Name', 'nationality':'Nationality', 'url':'URL',
'Unnamed: 5':'Unnamed: 5'}, inplace = True)
constructorCopy.head()
# Reading the 'constructorStandings.csv' file and creating a copy of it.
constructorStandings = pd.read_csv('constructorStandings.csv')
constructorStandingsCopy = constructorStandings.copy()
# Renaming the columns
constructorStandingsCopy.rename(columns = {'constructorStandingsId':'constructorStandingsId', 'raceId':'raceId',
'constructorId':'constructorId', 'points':'Points', 'position':'Position',
'positionText':'positionText','wins':'Wins', 'Unnamed: 7':'Unnamed: 7'}, inplace = True)
# Setting the index
constructorStandingsCopy.set_index('constructorStandingsId',inplace=True)
constructorStandingsCopy.head()
# Opeing the 'drivers.csv' TextIOWrapper file
driversWrapper = open('drivers.csv', encoding = "ISO-8859-1")
drivers = pd.DataFrame(driversWrapper)
# Creating a list
list_drivers=[]
# Splitting the comma seperated values
rowLabels_drivers=drivers[0][0].split(',')
# Looping through the length of the DataFrame
for i in range(1,843):
values=drivers[0][i].split(',')
# Appending the values to the list
list_drivers.append(values)
# Creating a new DataFrame by using the list
driversUpdated=pd.DataFrame(list_drivers)
# Creating a copy of the DataFrame
driversCopy= driversUpdated.copy()
# Renaming the columns
driversCopy.rename(columns = {0:'driverId', 1:'DriverRef', 2:'Number', 3:'Code',
4:'Forename', 5:'Surname', 6:'DOB', 7:'Nationality', 8:'URL'}, inplace = True)
# Setting the index
driversCopy.set_index('driverId',inplace=True)
# Dropping redundant column
driversCopy.drop([9], axis=1, inplace=True)
driversCopy.head()
# Opening the 'pitStops.csv' file and creating a copy of it after converting it to DataFrame from TextIOWrapper.
pitStopsWrapper = open('pitStops.csv', encoding = "ISO-8859-1")
pitStops = pd.DataFrame(pitStopsWrapper)
# Create a list
list_pitStops=[]
# Splitting the comma seperated values in the wrapper file
rowLabels_pitStops=pitStops[0][0].split(',')
# Looping through the length of the Dataframe
for i in range(1,6252):
values=pitStops[0][i].split(',')
# Appending the values to the list
list_pitStops.append(values)
# Creating a new DataFrame and adding the list to it
pitStopsUpdated=pd.DataFrame(list_pitStops)
# Creating a copy of the DataFrame
pitStopsCopy= pitStopsUpdated.copy()
# Renaming the columns
pitStopsCopy.rename(columns = {0:'raceId', 1:'driverId', 2:'Stop', 3:'Lap',
4:'Time', 5:'Duration', 6:'Duration in milliseconds'}, inplace = True)
# Setting the index
pitStopsCopy.set_index('raceId',inplace=True)
pitStopsCopy.head()
# Reading the 'races.csv' file and creating a copy of it
races = pd.read_csv('races.csv')
racesCopy = races.copy()
# Renaming the columns
racesCopy.rename(columns = {'raceId':'raceId', 'year':'Year','round':'Round', 'circuitId':'circuitId', 'name':'Name',
'date':'Date','time':'Time', 'url':'URL'}, inplace = True)
# Setting the index
racesCopy.set_index('raceId',inplace=True)
race2=racesCopy
racesCopy.head()
# Reading the 'seasons.csv' file and creating a copy of it
seasons = pd.read_csv('seasons.csv')
seasonsCopy= seasons.copy()
# Renaming the columns
seasonsCopy.rename(columns = {'year':'Year','url':'URL'}, inplace = True)
# Setting the index
seasonsCopy.set_index('Year',inplace=True)
seasonsCopy.head()
# Reading the 'status.csv' file and creating a copy of it
status = pd.read_csv('status.csv')
statusCopy = status.copy()
# Renaming the columns
statusCopy.rename(columns = {'statusId':'statusId','status':'Status'}, inplace = True)
# Setting the index
statusCopy.set_index('statusId',inplace=True)
statusCopy.head()
print("Number of winners: "+str(len(results[results['position'] == 1.0])))
resultsCopy=resultsCopy[resultsCopy['Position'] == 1.0]
resultsCopy=resultsCopy[resultsCopy['Grid'] != 1]
print("Number of winners who were not in grid position 1: "+str(len(resultsCopy)))
required_drivers = resultsCopy[resultsCopy['Grid'] > 3].sort_values('Grid',ascending=False).drop_duplicates(subset='Grid',keep='first')
required_drivers
required_drivers = resultsCopy[resultsCopy['Grid'] > 15].sort_values('Grid',ascending=False).drop_duplicates(subset='Grid',keep='first')
required_drivers
df=pd.DataFrame()
df = [ driversCopy[driversCopy.index == j] for i in required_drivers['driverId'] for j in driversCopy.index if int(j) == i]
#df
drivers_url=[]
for i in range(len(df)):
drivers_url.append(df[i].URL)
from bs4 import BeautifulSoup
import requests
from IPython.display import Image
path = "/Users/prati/FormulaOne/Images/"
Image(filename = path + "Watson.jpg", width=250, height=250)
Image(filename = path + "Cross.png", width=250, height=250)
Image(filename = path + "Barrichello.png", width=250, height=250)
Image(filename = path + "Raikkonen.png", width=250, height=250)
Image(filename = path + "Schumi.png", width=250, height=250)
pit_time_list1 = pd.DataFrame(pitStopsCopy.loc['881']['Duration'])
mean_list_for_a_race1 = pit_time_list1['Duration'].astype(float).mean()
#mean_list_for_a_race1
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 99)
driverStandingsCopy = driverStandingsCopy.fillna(0)
driver_wins = {}
for index, row in driverStandingsCopy.iterrows():
if row["Position"] == 1:
if row["driverId"] in driver_wins.keys():
driver_wins[row["driverId"]] += 1
else:
driver_wins[row["driverId"]] = 1
#driver_wins
driverStandingsDf= pd.DataFrame(driverStandingsCopy)
driversDf= pd.DataFrame(driversCopy)
driverStandingsDf=driverStandingsDf.sort_values('driverId')
driversCopy.head()
driversCopy.reset_index(inplace=True)
driversCopy['driverId']=driversCopy['driverId'].astype('int')
data=pd.merge(driverStandingsCopy, driversCopy, on='driverId', how='left')
#data
data.groupby(['DriverRef']).count().sort_values(by='Wins', ascending=False, inplace=True)
#data
data = data[data["Position"]==1]
data["Surname"].value_counts()
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import chart_studio
import chart_studio.plotly as py
colors = ['grey'] * 55
colors[0] = 'crimson'
chart_studio.tools.set_credentials_file(username='pratikp', api_key='AHxozWNTaluAU7sx4YW7')
driverWins=[go.Bar(x=data["Surname"].value_counts().index, y=data["Surname"].value_counts().values, marker_color=colors)]
py.iplot(driverWins,filename='Wins by Drivers')
pit_stops_average_df = pd.DataFrame(columns=['Year', 'Average_pit_time'])
for year in range(2011,2018):
race_list_for_year = racesCopy[racesCopy['Year']==int(year)].index.unique()
mean_list_for_a_race = []
for race in race_list_for_year:
try:
pit_time_list = pd.DataFrame(pitStopsCopy.loc[str(race)]['Duration'])
mean_list_for_a_race += [pit_time_list['Duration'].astype(float).mean()]
except:
print(race)
mean_list_for_a_race
mean_dfrow_for_a_specific_year = pd.DataFrame(mean_list_for_a_race)
pit_stops_average_df = pit_stops_average_df.append({'Year':int(year),'Average_pit_time': float(mean_dfrow_for_a_specific_year.mean())}, ignore_index=True)
pit_stops_average_df
pit_time_list = pd.DataFrame(pitStopsCopy.loc['969']['Duration'])
meanPitTime = pit_time_list['Duration'].astype(float).mean()
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
sns.lineplot(x=pit_stops_average_df.Year, y=pit_stops_average_df.Average_pit_time);
plt.ylabel('Time in seconds')
plt.title('Pitstop times over the years')
racesCopy['circuitId']=racesCopy['circuitId'].astype('int')
racesCopy=racesCopy.reset_index()
circuitsCopy=circuitsCopy.reset_index()
circuitsCopy['circuitId']=circuitsCopy['circuitId'].astype('int')
dataCirc=pd.merge(circuitsCopy, racesCopy, on='circuitId', how='left')
dataCirc[['circuitId','Name_x','Location','country','raceId']]
constructorStandingsCopy=constructorStandingsCopy.reset_index()
dataCirc2=pd.merge(dataCirc, constructorStandingsCopy, on='raceId', how='left')
pd.set_option("display.max_rows", 12000)
#data3.head()
dataCirc2[dataCirc2['Position']==1.0].head()
#dataCircConst=dataCirc2.groupby(['circuitId','constructorId']).count()
dataCircConst=dataCirc2.groupby(['circuitId','constructorId']).count().sort_values(by='Wins',ascending=False)
dataCircConst_copy=dataCircConst
dataCircConst_copy_index=dataCircConst_copy.reset_index()
dataCricConstDf=pd.DataFrame(dataCircConst_copy_index.groupby(by='circuitId')['Wins'].max())
mergedDf_CircConst=pd.merge(dataCircConst_copy_index, dataCricConstDf, on=['circuitId','Wins'], how='inner')
mergedDf_CircConst_filtered=mergedDf_CircConst[['circuitId','constructorId','Wins']]
mergedDf_CircConst_filtered.sort_values(by=['circuitId','constructorId'],ascending=True).head()
mergedDf_CircConst_merged=pd.merge(mergedDf_CircConst_filtered, circuitsCopy, on='circuitId', how='left')
for index, row in mergedDf_CircConst_merged.iterrows():
s = constructorCopy[constructorCopy['constructorId']==row['constructorId']]['Name']
mergedDf_CircConst_merged.at[index,'constructorname']=(s.to_string(index=False))
fig = go.Figure(data=[
go.Bar(name='Wins', x=mergedDf_CircConst_merged.CircuitRef, y=mergedDf_CircConst_merged.Wins, text = mergedDf_CircConst_merged.constructorname, textposition='auto', hovertext=mergedDf_CircConst_merged.constructorname),
# go.Bar(name='6:30PM Section', x=candy_df[candy_df.Section == '6:30PM'].Candy, y=candy_df[candy_df.Section == '6:30PM'].Count),
])
fig.update_layout(barmode='group')
fig.show()
res=results
race=racesCopy
values = {'position': -1, 'number': -1, 'time': 0, 'milliseconds': 0}
res=res.fillna(value=values)
dataMergeRaceResult=pd.merge(res, race, on='raceId', how='left')
dataMergeRaceResult=dataMergeRaceResult[dataMergeRaceResult['position'] == -1.0]
dataResRaceGrp=dataMergeRaceResult.groupby(['circuitId']).count().sort_values(by='raceId',ascending=False)
dataResRaceGrp=dataResRaceGrp.reset_index()
dataResRaceGrp[['circuitId','number']].head()
dataResRaceCircMerged=pd.merge(dataResRaceGrp, circuitsCopy, on='circuitId', how='left')
dataResRaceCircMerged[['circuitId','number','Latitude','Longitude']].head()
fig = go.Figure(go.Densitymapbox(lat=dataResRaceCircMerged.Latitude, lon=dataResRaceCircMerged.Longitude, z=dataResRaceCircMerged.number, radius=35))
fig.update_layout(mapbox_style="stamen-terrain", mapbox_center_lon=180)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
circuitMap['Raceloc'] = 0
circuit_id_list = list(race2[race2['Year'] == 2018]['circuitId'])
circuit_id_list
circuit_id_list_notlive_i = list(race2[race2['Year'] != 2018]['circuitId'])
circuit_id_list_notlive = [x for x in circuit_id_list_notlive_i if x not in circuit_id_list]
# len(circuit_id_list_notlive)
# len(circuit_id_list)
# circuitsCopy[~circuitsCopy['circuitId'].isin(circuit_id_list)]['Raceloc'] = 1
circuit_id_list
# circuit_id_list
for rowId in circuit_id_list:
circuitMap.loc[str(rowId),'Raceloc'] = 1
circuitMap.head()
import seaborn as sns
import plotly.graph_objects as go
mapbox_access_token = 'pk.eyJ1Ijoic3VtdWtoc2hhcm1hIiwiYSI6ImNrMm92bjJydzE3c3UzcHQ1Nzdjb3JwNHMifQ.x_ybvMGOCORXoXDYqnJCCg'
class_map_data1 = go.Scattermapbox(
lon = circuitMap[circuitMap['Raceloc']!=1]['Longitude'],
lat = circuitMap[circuitMap['Raceloc']!=1]['Latitude'],
text = circuitMap[circuitMap['Raceloc']!=1]['Name'],
hoverinfo='text',
mode = 'markers',
marker = dict(
color = 'red',
symbol = 'circle',
opacity = .5
),
name = "Not live circuits"
)
class_map_data2 = go.Scattermapbox(
lon = circuitMap[circuitMap['Raceloc']==1]['Longitude'],
lat = circuitMap[circuitMap['Raceloc']==1]['Latitude'],
text = circuitMap[circuitMap['Raceloc']==1]['Name'],
hoverinfo='text',
mode = 'markers',
marker = dict(
color = 'green',
symbol = 'circle',
opacity = 1
),
name = "Live circuits"
)
class_map_layout = go.Layout(
title = 'Circuit Locations',
mapbox=go.layout.Mapbox(
accesstoken=mapbox_access_token,
zoom=1
)
)
class_map = go.Figure(data=[class_map_data1,class_map_data2], layout=class_map_layout)
class_map.show()
The possibility of a driver winning is a race is largely influenced by the position that he starts the race in. Less than two percent of all drivers who have started a race from positions 4 and above have gone on to win the race
Having won 121 races and 7 world championships, Michael Schumacher is the greatest driver of all time.
Pitstop times have decreased considerably over the years with a significant decrease of ~15 seconds from 2011 to 2012. As more rules are changed in the future, we expect to see these statistics change again.
There are a few circuits where one constructor has dominated over the years. However, there are other circuits where the top spot, in terms of number of wins, has been shared by different teams. Our circuits of interest mainly being Monaco and Monza, have a clear winning constructor, Ferrari, one of the most successful in the history of Formula 1.
Street circuits are dangerous to race on and have resulted in numerous retirements and fatal crashes over the years. The 'Circuit de Monaco' stands out as the most dangerous circuit in the world with 873 retirements. Monza circuit in Italy is the most dangerous non-street circuit due to its high speed straights and tricky corners.
Formula 1, having been organized in more than 70 circuits across different continents, is truly a global sport.
We have uploaded our project to GitHub and here is the navigation link for the same.
https://github.com/pratikp1681/FormulaOne/